CLAIMS 



What is claimed is: 

1 . A method for evaluating and outputting a final clustering solution for 
a plurality of multi-dimensional data records, said data records having multiple, 
heterogeneous feature spaces represented by feature vectors, said method 
comprising: 

defining a distortion between two feature vectors as a weighted sum of 
distortion measures on components of said feature vector; 

clustering said multi-dimensional data records into k-clusters using a 
"convex programming" formulation; and 

selecting feature weights of said feature vectors. 

2. The method according to claim 1, wherein said selecting of feature 
weights are optimized by an "objective" function to produce said solution of a 
final clustering that simultaneously minimizes average intra-cluster dispersion and 
maximizes average inter-cluster dispersion along all said feature spaces. 

3. The method according to claim 1, wherein said clustering includes 
initially applying a local minima of said clustering. 
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1 4. The method of claim 1, wherein said clustering comprises a k-means 

2 clustering algorithm. 

1 5. The method of claim 2, wherein said minimizing distortion of individual 

2 clusters includes taking said data records and iteratively determining Voronoi 

3 partitions until said "objective" function, between two successive iterations, is 

4 less than a specified threshold. 

1 6. The method of claim 1, wherein said clustering comprises analyzing word 

p 2 data, and said feature vectors comprise multiple-word frequencies of said data 



w 



records. 



^ 1 7. The method of claim 1 , wherein said clustering comprises analyzing data 

3=== 2 records having numerical and categorical attributes, and said feature vectors 

U 

jffj 3 comprise linearly-scaled numerical attributes and each q-ary categorical feature 

::: 

O 4 using a 1 -in-q representation of said data records. 

1 8. A method for evaluating and outputting a clustering solution for a plurality 

2 of multi-dimensional data records, said data records having multiple, 

3 heterogeneous feature spaces represented by feature vectors, said method 

4 comprising: 

5 defining a distortion between two said feature vectors as a weighted sum 
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• • 

6 of distortion measures on components of said feature vector; 

7 clustering said multi-dimensional data records into k-clusters using a 

8 "convex programming" formulation of a generalized k-means clustering function; 

9 and 

10 selecting optimal feature weights of said feature vectors by an "objective" 

1 1 function to produce said solution of a final clustering that simultaneously 

12 minimizes average intra-cluster dispersion and maximizes average inter-cluster 

13 dispersion along all said feature spaces. 

o 

y"3 1 9. The method of claim 8, wherein said clustering includes initially applying 

M 2 a local minima of said clustering. 

01 1 10. The method of claim 8 5 wherein said minimizing distortion of individual 
^ 2 clusters includes taking said data records and iteratively determining Voronoi 

p"! 3 partitions until said "objective" function, between two successive iterations, is 

2 4 less than a specified threshold. 

1 11. The method of claim 8, wherein said clustering comprises analyzing word 

2 data, and said feature vectors comprise multiple-word frequencies of said data 

3 records. 
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1 12. The method of claim 8, wherein said clustering comprises analyzing data 

2 records having numerical and categorical attributes, and said feature vectors 

3 comprise linearly-scaled numerical attributes and each q-ary categorical feature 

4 using a 1-in-q representation of said data records. 

1 13. A computer system for data mining and outputting a final clustering 

2 solution, wherein said system includes a memory for storing a database having a 

3 plurality of multi-dimensional data records, each having multiple, heterogeneous 

O 

y?3 4 feature spaces represented by feature vectors, said system including a processor 

! for executing instructions comprising: 

i defining a distortion between two feature vectors as a weighted sum of 

1 distortion measures on components of said feature vector; 
jhjj 8 clustering said multi-dimensional data records into k-clusters using a 

flj 9 "convex programming" formulation; and 
M 1 0 selecting feature weights of said feature vectors. 

1 14. The system of claim 13, wherein said instruction for selecting of said 

2 feature weights are optimized by implementing an "objective" function to produce 

3 said solution of a final clustering that simultaneously minimizes average 

4 intra-cluster dispersion and maximizes average inter-cluster dispersion along all 

5 said feature spaces. 
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15. The system of claim 13, wherein said instruction of said clustering 
includes an instruction for initially applying a local minima of said clustering. 

16. The system of claim 13, wherein said instruction for clustering 
includes instructions for implementing a k-means clustering algorithm. 

17. The system of claim 14, wherein said instruction for minimizing 
distortion of individual clusters includes taking said data records and iteratively 
determining Voronoi partitions until said "objective" function, between two 
successive iterations, is less than a specified threshold. 

18. The system of claim 13, wherein said instruction for clustering includes 
instructions for analyzing word data. 

19. The system of claim 13, wherein said instruction for clustering includes 
instructions for analyzing data records having numerical and categorical attributes. 

20. A program storage device readable by machine, tangibly embodying a 
program of instructions executable by said machine to perform a method for 
evaluating and outputting a final clustering solution from a set of data records 
having multiple, heterogeneous feature spaces represented as feature vectors, said 
method comprising: 
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defining a distortion between two feature vectors as a weighted sum of 
distortion measures on components of said feature vector; 

clustering said multi-dimensional data records into k-clusters using a 
"convex programming" formulation; and 

selecting feature weights of said feature vectors. 

21. The device of claim 20, wherein said selecting of feature weights are 
optimized by an "objective" function to produce said solution of a final clustering 
that simultaneously minimizes average intra-cluster dispersion and maximizes 
average inter-cluster dispersion along all said feature spaces. 

22. The device of claim 20, wherein said clustering includes initially 
applying a local minima of said clustering. 

23. The device of claim 20, wherein said clustering comprises a k- means 
clustering algorithm. 

24. The device of claim 2 1 , wherein said minimizing distortion of 
individual clusters includes taking said data records and iteratively determining 
Voronoi partitions until said "objective" function, between two successive 
iterations, is less than a specified threshold. 
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1 25. The device of claim 20, wherein said clustering comprises analyzing 

2 word data, and said feature vectors comprise multiple-word frequencies of said 

3 data records. 

1 26. The device of claim 20, wherein said clustering comprises analyzing 

2 data records having numerical and categorical attributes, and said feature vectors 

3 comprise linearly-scaled numerical attributes and each q-ary categorical feature 

4 using a 1-in-q representation of said data records. 
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